The present invention relates to a method for calculating similarity among documents and more particularly, to a method for searching, from a document database, a document containing contents similar to that described in a document designated by a searcher.
As a technique for searching or retrieving an intended document from a large number of electronic documents, a similar document searching technique has been known. JP-A-2002-73681 gives a description that in the similar document search technique, a document designated by a searcher (hereinafter referred to as a source document) and a document stored in a document database (hereinafter referred to as a registered document) are expressed by means of vectors each having vector elements represented by appearance information such as frequencies of appearance of words contained in the documents (hereinafter referred to as characteristic vectors) and a distance between the characteristic vectors is calculated as a similarity between the documents.
In the aforementioned conventional technique, however, the characteristic vector is so formed as to have one element represented by the appearance information of each word appearing in the documents and therefore, when one concept is expressed by a plurality of words, the similarity is calculated, with that concept being emphasized and there occurs a possible case in which the search or retrieval gives rise to a result unmeet for an intention of the searcher.